Counting Lumps in Word Space: Density as a Measure of Corpus Homogeneity
نویسندگان
چکیده
This paper introduces a measure of corpus homogeneity that indicates the amount of topical dispersion in a corpus. The measure is based on the density of neighborhoods in semantic word spaces. We evaluate the measure by comparing the results for five different corpora. Our initial results indicate that the proposed density measure can indeed identify differences in topical dispersion.
منابع مشابه
Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora
How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering. Word frequency lists are cheap and easy to generate so a measure based on them would be of use as a quick guide in many circumstances; for example, to judge how a newly available corpus related to existing resources, or how easy it might be to port an NLP system designed t...
متن کاملITRI-97-07 Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora
How similar are two corpora? A measure of corpus similarity would be very useful for language engineering. Word frequency lists are cheap and easy to generate so a measure based on them would be of use as a quick guide in many circumstances; for example, to judge how a newly available corpus related to existing resources, or how easy it might be to port an NLP system designed to work with one t...
متن کاملVocabulary Lists for EAP and Conversation Students
Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...
متن کاملDeveloping a Corpus-Based Word List in Pharmacy Research Articles: A Focus on Academic Culture
The present corpus-based lexical study reports the development of a Pharmacy Academic Word List (PAWL); a list of the most frequent words from a corpus of 3,458,445 tokens made up of 800 most recent pharmacy texts including research articles, review articles, and short communications in four sub-disciplines of pharmacy. WordSmith (Scott, 2017) and AntWordProfiler (Anthony, 2014) were used to sc...
متن کاملMeasures for Corpus Similarity and Homogeneity
How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by 'corpus similariti: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity c...
متن کامل